AITopics | text extraction

Collaborating Authors

text extraction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Title block detection and information extraction for enhanced building drawings search

Lombardi, Alessio, Duan, Li, Elnagar, Ahmed, Zaalouk, Ahmed, Ismail, Khalid, Vakaj, Edlira

arXiv.org Artificial IntelligenceApr-14-2025

The architecture, engineering, and construction (AEC) industry still heavily relies on information stored in drawings for building construction, maintenance, compliance and error checks. However, information extraction (IE) from building drawings is often time-consuming and costly, especially when dealing with historical buildings. Drawing search can be simplified by leveraging the information stored in the title block portion of the drawing, which can be seen as drawing metadata. However, title block IE can be complex especially when dealing with historical drawings which do not follow existing standards for uniformity. This work performs a comparison of existing methods for this kind of IE task, and then proposes a novel title block detection and IE pipeline which outperforms existing methods, in particular when dealing with complex, noisy historical drawings. The pipeline is obtained by combining a lightweight Convolutional Neural Network and GPT-4o, the proposed inference pipeline detects building engineering title blocks with high accuracy, and then extract structured drawing metadata from the title blocks, which can be used for drawing search, filtering and grouping. The work demonstrates high accuracy and efficiency in IE for both vector (CAD) and hand-drawn (historical) drawings. A user interface (UI) that leverages the extracted metadata for drawing search is established and deployed on real projects, which demonstrates significant time savings. Additionally, an extensible domain-expert-annotated dataset for title block detection is developed, via an efficient AEC-friendly annotation workflow that lays the foundation for future work.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.08645

Country: Europe > United Kingdom (0.28)

Genre: Research Report (1.00)

Industry: Construction & Engineering (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Making History Readable

Banerjee, Bipasha, Goyne, Jennifer, Ingram, William A.

arXiv.org Artificial IntelligenceNov-26-2024

The Virginia Tech University Libraries (VTUL) Digital Library Platform (DLP) hosts digital collections that offer our users access to a wide variety of documents of historical and cultural importance. These collections are not only of academic importance but also provide our users with a glance at local historical events. Our DLP contains collections comprising digital objects featuring complex layouts, faded imagery, and hard-to-read handwritten text, which makes providing online access to these materials challenging. To address these issues, we integrate AI into our DLP workflow and convert the text in the digital objects into a machine-readable format. To enhance the user experience with our historical collections, we use custom AI agents for handwriting recognition, text extraction, and large language models (LLMs) for summarization. This poster highlights three collections focusing on handwritten letters, newspapers, and digitized topographic maps. We discuss the challenges with each collection and detail our approaches to address them. Our proposed methods aim to enhance the user experience by making the contents in these collections easier to search and navigate.

confidence score, summarization, text extraction, (16 more...)

arXiv.org Artificial Intelligence

2411.176

Country:

North America > United States > Virginia > Montgomery County > Blacksburg (0.05)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)

Genre: Research Report (0.50)

Industry: Media > News (0.39)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

Enhancing Steganographic Text Extraction: Evaluating the Impact of NLP Models on Accuracy and Semantic Coherence

Li, Mingyang, Yuan, Maoqin, Li, Luyao, Pengsihua, Han

arXiv.org Artificial IntelligenceFeb-28-2024

This study discusses a new method combining image steganography technology with Natural Language Processing (NLP) large models, aimed at improving the accuracy and robustness of extracting steganographic text. Traditional Least Significant Bit (LSB) steganography techniques face challenges in accuracy and robustness of information extraction when dealing with complex character encoding, such as Chinese characters. To address this issue, this study proposes an innovative LSB-NLP hybrid framework. This framework integrates the advanced capabilities of NLP large models, such as error detection, correction, and semantic consistency analysis, as well as information reconstruction techniques, thereby significantly enhancing the robustness of steganographic text extraction. Experimental results show that the LSB-NLP hybrid framework excels in improving the extraction accuracy of steganographic text, especially in handling Chinese characters. The findings of this study not only confirm the effectiveness of combining image steganography technology and NLP large models but also propose new ideas for research and application in the field of information hiding. The successful implementation of this interdisciplinary approach demonstrates the great potential of integrating image steganography technology with natural language processing technology in solving complex information processing problems.

accuracy, nlp model, steganographic information, (13 more...)

arXiv.org Artificial Intelligence

2402.18849

Country:

Asia > China > Xinjiang Uygur Autonomous Region (0.14)
Asia > China > Beijing > Beijing (0.05)
Asia > Singapore (0.05)

Genre: Research Report > New Finding (0.89)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot Document-Level Question Answering

McDonald, Tavish, Tsan, Brian, Saini, Amar, Ordonez, Juanita, Gutierrez, Luis, Nguyen, Phan, Mason, Blake, Ng, Brenda

arXiv.org Artificial IntelligenceDec-11-2023

Researchers produce thousands of scholarly documents containing valuable technical knowledge. The community faces the laborious task of reading these documents to identify, extract, and synthesize information. To automate information gathering, document-level question answering (QA) offers a flexible framework where human-posed questions can be adapted to extract diverse knowledge. Finetuning QA systems requires access to labeled data (tuples of context, question and answer). However, data curation for document QA is uniquely challenging because the context (i.e. answer evidence passage) needs to be retrieved from potentially long, ill-formatted documents. Existing QA datasets sidestep this challenge by providing short, well-defined contexts that are unrealistic in real-world applications. We present a three-stage document QA approach: (1) text extraction from PDF; (2) evidence retrieval from extracted texts to form well-posed contexts; (3) QA to extract knowledge from contexts to return high-quality answers -- extractive, abstractive, or Boolean. Using QASPER for evaluation, our detect-retrieve-comprehend (DRC) system achieves a +7.19 improvement in Answer-F1 over existing baselines while delivering superior context selection. Our results demonstrate that DRC holds tremendous promise as a flexible framework for practical scientific document QA.

extraction, pdfminer, qasper, (15 more...)

arXiv.org Artificial Intelligence

2210.01959

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Merced County > Merced (0.04)
North America > Dominican Republic (0.04)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)

Add feedback

4 must-try new features in Windows 11's huge 2023 Update

PCWorldNov-10-2023, 12:30:00 GMT

Windows 11's 2023 Update is here, bringing with it a number of new features to explore. But which ones are worth trying? We've listed our favorites, below. Windows 11's 2023 Update is (eventually) being pushed to your PC as a free, cumulative update, which means that it encompasses features and applications that may have already arrived on your PC. Windows 11 users will receive most of the 2023 Update features by Nov. 14, though it may take longer for some systems.

copilot, window 11, window 11 2023, (8 more...)

PCWorld

Technology: Information Technology > Artificial Intelligence > Natural Language (0.50)

Add feedback

Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English

Vasantharajan, Charangan, Tharmalingam, Laksika, Thayasivam, Uthayasanker

arXiv.org Artificial IntelligenceDec-15-2022

Most low-resource languages do not have the necessary resources to create even a substantial monolingual corpus. These languages may often be found in government proceedings but mainly in Portable Document Format (PDF) that contains legacy fonts. Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding, which are not optimized for text extraction. Therefore, we propose a simple, automatic, and novel idea that can scale for Tamil, Sinhala, English languages, and many documents along with parallel corpora. Since Tamil and Sinhala are Low-Resource Languages, we improved the performance of Tesseract by employing LSTM-based training on more than 20 legacy fonts to recognize printed characters in these languages. Especially, our model detects code-mixed text, numbers, and special characters from the printed document. It is shown that this approach can reduce the character-level error rate of Tesseract from 6.03 to 2.61 for Tamil (-3.42% relative change) and 7.61 to 4.74 for Sinhala (-2.87% relative change), as well as the word-level error rate from 39.68 to 20.61 for Tamil (-19.07% relative change) and 35.04 to 26.58 for Sinhala (-8.46% relative change) on the test set. Also, our newly created parallel corpus consists of 185.4k, 168.9k, and 181.04k sentences and 2.11M, 2.22M, and 2.33M Words in Tamil, Sinhala, and English respectively. This study shows that fine-tuning Tesseract models on multiple new fonts help to understand the texts and enhances the performance of the OCR. We made newly trained models and the source code for fine-tuning Tesseract, freely available.

corpus, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/IALP57159.2022.9961304

2109.05952

Country:

Asia > Sri Lanka > Western Province > Colombo > Colombo (0.05)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.56)

Add feedback

Is Data a Differentiator for Your Business? If So, Traditional OCR Cannot Be An Answer - insideBIGDATA

#artificialintelligenceOct-24-2021, 13:56:06 GMT

If your business is driven by data, Optical Character Recognition (OCR) -- as most of us know it -- is not the answer. For those of you who view OCR as an industry staple for document processing, let me explain. OCR as a technology has been around for ages and it still has its place in processing unstructured document formats like PDFs, images, and other text formats that cannot be edited digitally. Users can quickly convert those files into editable documents. In short, it's a terrific technology for enabling you to edit and search for files that may have been "frozen."

automl, differentiator, document processing, (7 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.56)
Information Technology > Artificial Intelligence > Machine Learning (0.54)
Information Technology > Artificial Intelligence > Natural Language (0.53)

Add feedback

Text Extraction in Python with Neural Networks

#artificialintelligenceNov-22-2020, 04:16:23 GMT

Image capture makes a snapshot in time of a person, place, or object. Many devices include cameras for taking pictures. This is integrated into everyday life. When taking the picture, there is recognition of that picture and often an autocorrection. Taking that further, there is Optical Character Recognition (OCR) that can take a picture of text and create a usable file that is same as document.

neural network, python, text extraction, (8 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

AI-Powered OCR -- Laying Groundwork for Automation? - DZone AI

#artificialintelligenceOct-21-2020, 21:03:26 GMT

We arguably live in one of the phenomenal eras witnessing technological disruption. We are transforming into a more digitalized world where businesses are going digital. Especially, when the recent pandemic situation has made us realize the importance of digitization and global connectivity. As a result, a countless number of physical documents have been digitized using advanced technologies. One of these is Optical Character Recognition (OCR).

artificial intelligence, machine learning, natural language, (14 more...)

#artificialintelligence

Industry: Health & Medicine (0.32)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.60)

Add feedback

Negative Statements Considered Useful

Arnaout, Hiba, Razniewski, Simon, Weikum, Gerhard

arXiv.org Artificial IntelligenceJan-20-2020

Knowledge bases (KBs), pragmatic collections of knowledge about notable entities, are an important asset in applications such as search, question answering and dialogue. Rooted in a long tradition in knowledge representation, all popular KBs only store positive information, while they abstain from taking any stance towards statements not contained in them. In this paper, we make the case for explicitly stating interesting statements which are not true. Negative statements would be important to overcome current limitations of question answering, yet due to their potential abundance, any effort towards compiling them needs a tight coupling with ranking. We introduce two approaches towards compiling negative statements. (i) In peer-based statistical inferences, we compare entities with highly related entities in order to derive potential negative statements, which we then rank using supervised and unsupervised features. (ii) In query-log-based text extraction, we use a pattern-based approach for harvesting search engine query logs. Experimental results show that both approaches hold promising and complementary potential. Along with this paper, we publish the first datasets on interesting negative information, containing over 1.1M statements for 100K popular Wikidata entities.

negative statement, occupation, wikidata, (16 more...)

arXiv.org Artificial Intelligence

2001.04425

Country:

North America > United States (0.14)
South America > Chile (0.14)
Oceania > New Zealand (0.04)
(6 more...)

Genre:

Research Report (1.00)
Personal > Honors (1.00)

Industry:

Leisure & Entertainment (1.00)
Health & Medicine (1.00)
Media > Film (0.68)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.68)
Information Technology > Communications > Web > Semantic Web (0.68)

Add feedback